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Abstract: One of the most challenging aspects of developing 
information systems is the processing and management of 
large volumes of information. One way to overcome this 
problem is to implement efficient data indexing and 
classification systems. As large volumes of generated data 
comprise of non-structured textual data, developing text 
processing, management and indexing frameworks can play 
an important role in providing users with accurate 
information according to their preferences. In this paper, a 
novel method of semantic information processing, 
management and indexing is introduced. The main goals of 
this study is to integrate structured knowledge of ontology 
and Knowledge Bases (KBs) in the core components of the 
method, to enrich the contents of the documents, to have 
multi-level semantic network representation of textual 
resources, to introduce a hybrid weighting schema (salient 
score) and finally to propose a hybrid method of semantic 
similarity computation. The structured knowledge of 
ontology and KBs are integrated from all aspects of the 
proposed method. The obtained results indicate the accuracy 
and optimal performance of the proposed framework. The 
obtained results suggest that using knowledge-based models 
leads to higher performance and accuracy in identifying and 
classifying documents according to user preferences; 
however, if learning-based models are not provided with 
sufficient amount of training data, they cannot yield 
satisfying results. The results also demonstrate that the 
complete integration of ontology and KBs in information 
systems can significantly contribute to a better representation 
of documents and evidently superior functionality of 
information processing, management and indexing systems. 


Keywords: Ontology, Knowledge Base, Semantic 
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1. Introduction 

Semantic information indexing and classification system 
deals with finding the most suitable representation of the 
documents and the best approaches to differentiate between 
the relevant and irrelevant documents in any given 
information domain. The representation model specifies how 
the documents and queries should be represented. Usually, a 
defined similarity metric determines the most relevant 
documents to a given information domain. The majority of 
information indexing and classification systems use a very 
simple representation model for documents and queries 
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called the bag-of-words model which consists of a collection 
of single-word linguistic units. These models usually employ 
the exact-term-matching methods to search for the most 
relevant documents. Such a representation model suffers 
from serious limitations which are documented in several 
research papers [1-7]. Most of these limitations are present 
due to the inherent ambiguity in the content and the 
incapability of these models to represent the context of 
documents. They also suffer from other problems, such as 
synonymy and polysemy; therefore, it is hard to describe 
user’s precise information needs via just keywords. So far, 
several methods have been introduced to overcome these 
limitations and problems, and knowledge-based approaches 
are among such methods. These methods utilize the 
structured knowledge of ontology and KBs to produce a 
semantic representation of the documents and user queries 
and also to draw a comparison between them using semantic 
similarity methods. 

The knowledge-based methods [8-11] employ the 
structured knowledge of ontology and KBs to compute the 
true contextual meaning of words, semantic indexing and to 
identify the semantics in the information systems. In sematic 
indexing (i.e., the semantic representation of documents), the 
purpose is to extract or derive features and semantic 
structures that can describe the information content of the 
documents. Therefore, the main challenge is to determine a 
methodology for identifying the majority of relevant 
concepts and semantic structures while ignoring the 
irrelevant ones. 

One of the most significant aspect of the proposed 
method is the semantic network representation of textual 
resources. The semantic network generally consists of a 
number of connected nodes (representing the 
concepts/words in the document.). These nodes are 
connected via edges. The connecting links between nodes in 
a semantic network represent the different relations between 
the concepts/words. The main idea is to extract every piece 
of useful and significant information about the information 
content from structured knowledge sources and generate a 
comprehensive representation of documents. The proposed 
method can be used in a number of IP&M-related 
applications, such as semantic indexing of the documents, 
document classification, topic spotting, personalized 
information filtering and recommender systems. 

Two major factors play an important role in the novelty 
of the proposed system. Firstly, considering the synergy 
relationship between the different components of a text 
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processing, indexing and management system, the idea of 
integrating structured knowledge into all components of the 
proposed system is presented. Secondly, there is a logical 
relationship between different components of the proposed 
system. In other words, a multi-layer module designed for 
feature extraction can identify the information structures 
contained in textual documents, and then content enrichment 
modules are introduced based on the extracted features. 
These modules, based on the extracted features, attempt to 
enrich and identify relevant information structures. The 
extracted features and enriched information structures are 
integrated into a graphical model (semantic networks). Also, 
the semantic similarity computation module calculates the 
similarity between documents and user preferences based on 
the extracted features, the identified information structures, 
and the generated representation model. Therefore, all 
modules and the processes embedded in them are designed 
in an articulate manner to implement a text-based 
information processing, indexing and management system. 

Such characteristics play an important role in the novelty of 

the proposed method. 

However, the novelty of the present paper is explained in 
the following: 

e The integration of structured knowledge of ontology and 
KBs in every component of the proposed semantic 
information indexing and management system. 

e Utilizing the semantic networks for comprehensive 

e Multi-level representation of the documents and user 
queries while introducing a hybrid weighting schema to 
identify the most significant concepts for creating the 
semantic network. 

e Proposing a hybrid and multi-layer method of semantic 
similarity computation. 

The paper is structured as follows: in the second section, 
the related works are explored. In the third section, the 
research objectives are declared, the hierarchical and 
taxonomic structures of the top-level ontology, Wikipedia 
and WordNet are examined and the proposed method of 
semantic information indexing and management is 
introduced. In the fourth section, the evaluation results are 
offered and finally, in the fifth section, the conclusion is 
presented. 


2. Related Works 

Three important, yet different criteria, will determine what 

kind of information indexing and management method can 

be used in a text mining application: 

e What kind of information model should be employed? 

e Should we assume semantic relations between 
concepts/words? 

e Should we utilize structured KBs such as ontology?. 

The information models determine how the textual 
resources should be represented and how the similarity 
between representation models of documents should be 
measured, so that the most similar documents to user 
preferences are identified. The probabilistic models and the 
Vector Space Models (VSMs) are among the widely used 
information models [3]. For example, the language models 
[1,12] and the Bayesian network models [13,14] are 
considered among the probabilistic models. The vector space 
models [15] represent the textual resources in a vector form, 


and the similarity between them is usually calculated using 
the cosine similarity measure. As the majority of the 
traditional information management model do not 
disambiguate the concepts/words and use basic feature 
extraction techniques, they are very easy to implement. 
However, they exhibit relatively low precision and poor 
performance. In one study [16], the authors introduce a 
hybrid Sentence-Vector Space Model (S-VSM) and 
Unigram representation models for the text document. 
However, in recent years, numerous studies [1, 8, 17-21] 
have utilized the graph-based methods for the information 
indexing and management in which a domain/Top-Level 
ontology is often used to represent textual resources and their 
contextual semantics in the form of a graph. 


2.1 Learning-based Information Systems 

Intelligent learning models are also used in the field of text 
mining. The bag-of-concepts method was introduced [22] as 
an alternative document representation method. The 
proposed method creates concepts through clustering word 
vectors generated from word2vec and uses the frequencies 
of these concept clusters to represent document vectors. In 
another study [23], the Metzler and Croft’s MRF (Markov 
Random Field) model [24] is employed to construct the 
information model and a supervised learning method called 
regression rank [25] to improve the performance of the 
Markov information model. Also, the integration of machine 
learning techniques and knowledge-based methods has 
proved to be quite beneficial. For example, in one study [26], 
a novel framework for incorporating KB into the neural 
network is introduced to produce a high quality 
representation of text. The most important shortcoming of 
learning-based methods is their reliance on domain data for 
training a classification model. As the proposed framework 
is reliant on multi-domain structured knowledge of ontology 
and KBs, it can achieve better performance and accuracy. 


2.2 Model-based Information Systems 
As mentioned earlier, assuming semantic relations between 


concepts/words determines what kind information indexing 
and management method can be used for a text mining 
application [13,27,28]. The majority of the traditional 
methods are based on the Bag of Words (BoW) models. The 
underlying assumption in these models is that a document 
can be represented by a set of not connected concepts or 
words (i.e. no relation is defined between concepts/words) 
[15]. These information models usually need an additional 
term weighting schema; therefore, selecting a proper 
weighting schema has a profound effect on the accuracy and 
precision of the model. In one study [29], the importance of 
employing a suitable weighting schema for information 
retrieval-related applications is emphasized. However, since 
these models do not take into account the (semantic) 
relations between the concepts/words, unsatisfactory results 
are often obtained. 

To overcome the drawbacks of the BoW-based systems, 
term-dependence models are introduced. These models 
exploit the relations established between concepts/words. 
For example, in one study [30], a fuzzy-based method for 
considering the relation between index terms is introduced. 
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Also, a number of conducted studies [12,23,24] demonstrate 
that the term-dependence models exhibit far better 
performance than the Bag-of-words models. However, the 
biggest challenge of these models is the large number of 
training data needed to estimate the joint distribution of the 
documents and queries. Also, limited domain information 
might lead to unsatisfactory performance. Utilizing the 
structured knowledge of ontology and KBs can help the term 
dependency-based retrieval methods overcome this 
limitation. 

In the majority of the studies, using the WordNet 
Synonym sets are recommended to model the semantics (i.e., 
meaning) in the documents [31,32]. When coupled with an 
efficient Word Sense Disambiguation (WSD) technique, 
these systems exhibit good performance and accuracy. As 
the proposed system in this paper integrates structure 
knowledge and semantics of ontology and KBs in every 
component, it can easily overcome the shortcomings of 
model-based information systems. 


2.3 Information Systems with emphasis on information 
extraction techniques 

From another perspective, one can distinguish between the 
information indexing and management systems based on the 
information extraction module they use. Until now, different 
methods of extracting informative features are introduced; 
however, the main difference between them arise from two 
main factors: (1) the structure of auxiliary knowledge 
sources employed to extract features, (2) details of the 
extracted features. The natural language processing (NLP)- 
based methods are usually domain-independent and are used 
to extract semantic, syntactic and morphological features 
from documents and are usually computationally expensive 
[8,33-37]. However, in order to overcome the computational 
obstacles of these methods, the rule-based information 
extraction methods are introduced. These methods construct 
the extraction rules either manually or automatically. The 
automatic rule-based methods [8, 33, 38, 39, 40, 9] exhibit 
far better performance in domain-specific applications. 
Considerable number of manual rule-based methods are 
proposed for semantic annotation [41,42] and information 
extraction [43-46]. In this regard, the regular expressions are 
used to extract features and information from textual 
documents. Etzioni et al. [42] employed domain- 
independent rules to find the information and features that 
help system identify the correct class of documents. The use 
of ontology for information and feature extraction has also 
been investigated as such. In another study [41], a domain 
ontology is used to implement a semantic annotation and 
information extraction framework. 


2.4 Ontology-based Information Systems 

The ontology-based methods exploit the structured 
knowledge of ontology to implement a semantic framework 
for integrating knowledge in information indexing and 
management systems [48, 49, 11]. In one study [49], an 
ontology-based approach for integrating knowledge of 
domain ontologies in information extraction and retrieval 
systems is introduced. A detailed study of the information 
extraction, indexing and retrieval systems is presented in 
other studies [50, 51]. Meanwhile, utilizing ontology-based 
methods for semantic information indexing and management 


is another alternative for considering the term-dependency. 
In such methods, the relation between concepts/words are 
inferred using the graphical structure of the ontology. In the 
next step, the relations between concepts/words are 
identified and employed to compute the 
similarity/relatedness between the documents and user 
preferences. The structured KBs such as ontologies, 
Wikipedia and WordNet, are widely used in information 
indexing and management applications [17-19]. 

In one study [18], a personalized method of textual 
document search and retrieval according to user profiles is 
introduced in which the documents are retrieved and ranked 
according to a graph-based distance measure. The relations 
between the concepts are extracted using a web-based 
ontology called ODP [52]. In another study [53], a 
knowledge-based recommender system based on_ the 
integration of ontology and sequential pattern mining (SPM) 
for e-learning resource recommendation is introduced. The 
ontology is used for domain knowledge modelling and 
representation, and SPM is utilized for detecting the learners’ 
sequential learning patterns. 

Researchers [54] have also utilized domain ontology to 
establish semantic relations between the concepts/words and 
to construct the semantic networks [54]. As such, the 
relations between concepts are weighted according to a 
specific weighting schema, and then the documents are 
ranked and displayed according to their similarity to the user 
queries. 

The major problem with such systems is that they do not 
consider the synergy relationship between the different 
components of an information system. In this paper, the 
integration of structured knowledge and KBs in all 
components of the system is proposed to overcome this 
problem. 


2.5 Knowledge-based Information Systems 

The Wikipedia is also used for text mining applications and 
representation of the textual resources. The proposed method 
[55] represents each document as a concept vector in the 
Wikipedia's semantic space to model the text semantics. 
Then, several heuristic selection rules are defined to quickly 
pick out related concepts from the Wikipedia's semantic 
space. Then, the similarity between documents are computed 
to classify the documents. 

Also, the personalized retrieval and ranking methods are 
gaining interests in recent years. These methods facilitate the 
rapid access and accurate retrieval of the textual documents 
[18,52,54,56]. The most similar/related documents to the 
user preference are identified based on the similarity of user 
preferences and document contents. The user preferences are 
easily obtained by analyzing the usage data and user’s 
previously accessed documents. 

Like ontology-based information system, knowledge- 
based systems do not consider the synergy relationship 
between the different components of an information system. 
Therefore, in order to overcome this limitation, the 
integration of structured knowledge of ontology and KBs in 
every component of the proposed method is considered. 

The following table summarizes the related methods in 
indexing and information retrieval, their underlying model 
and their characteristics. 
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Table 1. Related methods and their characteristics 


Methods Personalization | Ontology-based Model Features 
i ve ai Term-Dependency assumption, Scalable, 
Kara et al. [27] No Yes Graph-based, Keyword-based Domain-Specific, ontology-based 
Lasse [35] No No Language Model No Assumption of Term-Dependency, General 
Domain 
Hahm et al. [23] Yes Yes Graph-based, Keyword-based Teim-Dependency-assumption;: Domain-Specific, 
ontology-based 
Metzler et al. [47] No No Language Model Term-Dependency assumption 
Li et al. [38] No Yes Graph-based Term-Dependency assumption, ontology-based 
Nefti et al. [27] No No Fuzzy-based Term-Dependency assumption 
Daoud et al. [13] Yes Yes Graph-based, Keyword-based Term-Dependency assumption, General Domain; 
ontology-based 
Li et al. [78] No No Intelligent Learning model Knowledge-based, conceptualized vector space 
. : ; Graph-based, Enriched Keyword- | Term-Dependency assumption, Scalable, General 
Proposed Method XS aS based, Language Model Domain, ontology-based, knowledge-based 
3. Proposed Method 


This section can be divided into three subsections: 1) 
research objectives, 2) the structures of ontology and KBs 
integrated into the proposed framework, 3) the specification 
and characteristics of the proposed information processing 
and management framework. 


3.1 Research Objectives 
The general objective of this paper is to develop a multi- 
purpose framework for collecting information from different 
knowledge sources and modelling the extracted semantic, 
lexical and syntactical features in a multi-level 
representation using the graph-like structure of semantic 
networks. In this regard, the specific objectives of this 
research are: 

1. To describe a multi-purpose text mining framework 
which integrates ontology and KBs for developing a 
multi-level representation of textual resources using 
machine-readable semantic networks. 

2. To describe a mechanism in which the information 
content of textual resources is enriched for better 
representation. 

3. To describe a hybrid multi-layer semantic similarity 
module for identifying resources that satisfy users’ 
information needs. 

4. To assess and analyze the performance and effectiveness 
of the proposed framework in semantic information 
indexing and management applications. 

5. To evaluate the effect of the enrichment module on the 
overall performance of the proposed method. 

6. To evaluate the effect of the representation module and 
semantic similarity mechanism and its components on 
the overall performance of the framework. 


3.2 The Structure of Ontology and KBs 

The ontology and KBs play a crucial role in identifying the 
semantics and context. Therefore, familiarizing with the 
hierarchical and taxonomical structure of these KBs helps us 
figure out what kind of semantic structures can be identified 
and extracted from textual resources. 


3.2.1 Onto WordNet Top-Level Ontology 

The OntoWordNet ontology (OWL alignment of the 
WordNet ontology with DOLCE-Lite Plus Ontology library) 
is an essential component of the proposed system. Every 
concept of the ontology is organized as synonym set, so that 
the contextually similar (or equivalent) concepts can be 
retrieved. This will facilitate the enrichment of the contents 
[57]. 


3.2.2 WordNet 

WordNet is an ontological lexicon for the English language. 
The purpose is to model a semantically enhanced lexicon for 
the English language. The main structure of WordNet 
consists of Synsets. The synset organizes a set of synonym 
concepts. More details about WordNet are available in the 
literature [58]. 


3.2.3 Wikipedia 

Wikipedia and BNC data which have been used in this 
research are available for academic use through D.I.S.C.O 
project. Both data are structured the same way. In one study 
[59], the manner in which the data are created is described. 
Both data structures consist of two sets of data: (1) first-order 
word vector, which contains words that occur together in 
Wikipedia and BNC corpus, and (2) second-order word 
vector, which contains words that occur in similar contexts. 
3.3 The Proposed Information and 
Management Framework 

The proposed method generates a semantic graphical 
representation (semantic network) of document contents and 
user profile and calculates the semantic similarity between 
them. The constructed user profile is used to personalize the 
information indexing and management process. Fig. 1 
illustrates the overview of the proposed system. The 
proposed system consists of two separate processes: (1) the 
semantic information indexing, and (2) the semantic 
information management. On the other hand, the proposed 
method consists of three major components: 1) semantic 
network generation module, 2) content enrichment module, 
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and 3) semantic similarity/relatedness computation module. 
As shown in Fig. 1, the documents and user queries act as the 
system input. Several pre-processing operations are 
performed on inputs and all the concepts undergo the word 
disambiguation (WSD) process. First, assuming none of the 
documents in the repository are indexed (by semantic 
networks), every document in the repository are indexed by 
their keywords. These simple indexes are then stored in a 
database or repository. In the next step, a Boolean matching 
model (known for its rapid and accurate pattern matching) 
[61, 62] is built. As soon as a query is made by a user, it is 
converted into a Boolean search expression and a set of 
documents from repository which are fully or partly matched 
with the Boolean expression, and then they can be identified 
and extracted as such. Every retrieved document will be 
represented by a semantic network. The proposed hybrid 
semantic similarity module is used to determine which of the 
retrieved documents are the most similar to user query. 
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Figure 1. Overview of Semantic indexing and Retrieval 
System 


Figure 2 depicts the document index matching and 
retrieval process. It should be noted that all the document 
semantic networks can be constructed offline. Also, the 
process of updating user profile semantic network using the 
previously accessed documents can be performed offline 
regularly and according to a pre-specified schedule. On the 
other hand, the process of calculating the semantic similarity 
between document semantic networks and user profiles is 
performed online and imposes negligible operational burden 
on the system. Also, the constructed semantic networks are 
stored in a database called index repository. The user 
profile’s semantic network is stored in a repository to 
facilitate regular updating process. In the following sections, 
the details surrounding the proposed method is discussed. 
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Figure 2. Document index Matching and Retrieval 


3.3.1 The semantic document pre-processing 

The following pre-processing techniques are performed on 
the contents of the documents:(1) Stop-Word Removal, (2) 
Uni-gram and Bi-gram processing of English Words, (3) 
Stemming concepts/ words [63,64], (4) Part of Speech 
Tagging [65], (5) Lemmatization of the concepts/words [65, 
66], (6) Named-Entity Recognition [67,68], (7) Bi-gram 
Authentication. The authenticity of Bi-gram concepts is 
validated using Wikipedia KB, which is performed by 
searching for the frequency of Bi-grams in Wikipedia. Once 
the authentication operation is over, the rejected Bi-grams 
are removed. 

The output of this module is the document/user_profile 
vector. In the next step, all the detected Uni-gram and bi- 
gram concepts are weighted using the CF-IDF weighting 
method (a variant of TF-IDF) [69]. Accordingly, the weight 
of the concept c; in document dj is calculated using the 
following equation: 


weight(c;,d;) = cfa; (ci). In(N/af) (1) 


Where N is the number of documents in repository, df 
(document frequency) is the frequency of the documents in 
which the concept c; appears. Also, the local frequency of a 
concept like c; which comprises of n words (n = 1) depends 
on the number of occurrences of concepts c; and all its sub- 
concepts. Therefore, the weighting equation is formally 
rewritten as follows: 


Cf (ci) = counta, (ci) 


r Length (sc) ; 
‘Length(c,) ` coun dj (sc) 


sc€Sub_ConceptsI 

(2) 

In this equation, Length(c;) represents the number of 
words in concept c; and Sub_Concepts(c) deputes all the 
possible sub-concepts which are directly derived from c;. Let 
D = {d,,dy,.......,d, } be the set of documents and d; = 
{tł, t?, Gra , ti} be the document vector for document d;, 
w; = {w}, wê, wẹ, n. ,w?} is the document weight 
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vector after the weighting method is applied to all the 
documents. 


3.3.2 The Word Sense Disambiguation of Concepts 
To handle the word ambiguity issue in text documents, the 
method of word sense discrimination [70] was introduced. In 
this method, the underlying assumption is that similar senses 
occur in similar contexts. Hence, by using a semantic 
similarity method to compare all possible senses of a concept 
with the context it appears in, we can predict the correct 
sense of a given concept in the document. To this end, the 
following procedures are performed: 

e A+*7 window around the concept in the respective 
document is created. This will create a context vector for 
the corresponding concept. Also, the first-order word 
vector for each member of the context vector is retrieved 
and appended to. This window and the appending 
vectors create a context vector for the concept. 

e The senses of the corresponding concept, an example of 
their usage in a sentence and their brief definition called 
gloss is extracted for each sense from WordNet. The 
Wikipedia-based first-order word vector for each sense is 
also retrieved. The collected information about a sense is 
aggregated and the sense vector is formed. As the first- 
order vector contains the co-occurrence words in similar 
contexts, the similarity found between the sense vector of 
each sense and the context vector determines the true 
contextual meaning of corresponding concept. 

e A combination of cosine [71] and Jaro-Winkler [71] 
measures is used to calculate the similarity score as 
follows: 


Sim(Senseyctor, contextyector) = 


1 
2 (Cosinesim + JafOwinktersim) 


(3) 


e The sense vector with highest similarity score is selected 
as the correct sense of the concept. The correct sense 
number is used to annotate the concept. 


3.3.3 The Enrichment process of extracted contents from 
documents 

In most cases, the extracted features are not a good 
representation of the document context. The main purpose of 
this section is to identify concepts and semantic structures 
that can better describe the document context. 


3.3.3.1 Enrichment using Wikipedia KB 

External knowledge sources such as Wikipedia can be used 
to improve the representation of textual resources [72]. As 
mentioned earlier, Wikipedia KB contains a set of 
information called second-order word vector. This vector not 
only contains the co-occurrence words in Wikipedia but also 
words that are contextually similar and interchangeable in 
different contexts. Employing this vector to enrich textual 
resources is an interesting idea that is proposed in this paper. 
These vectors are searched to identify the co-occurring and 
contextually similar concepts to a given concept/word. The 
identified concepts are then weighted according to the 
following equation and are appended to the document 


vectors/user profiles. 


Weightretatea Conceot (second-order vector) — 
Weightoriginat Concept * 0.8 
(4) 
Because the new concepts are obtained indirectly and are 
inferred using Wikipedia, their assigned weight is lower than 
the original concepts. The weighting parameter is estimated 
using a subset of evaluation data. 


3.3.3.2 Enrichment using OntoWordNet 
Ontology 

The OntoWordNet ontology classes are organized in the 
form of sequences. Every sequence defines a set of synonym 
concepts (the synonym concepts are separated by two 
consecutive underline “_” and the space between multi- 
word concepts are specified by an underline “_”) contexts. 
The concept map consists of a concept and a set of related 
ontology classes. The links between the concept and related 
ontology classes are the equivalent property and the 
subclass/superclass relations. The equivalent property is 
transitive and reversible. The aforementioned procedure 
results in the creation of concept maps for each concept. 
Also, the concept maps play a vital role in constructing a 
multi-level representation of documents. It should be noted 
that the obtained concept maps are represented by a sub- 
ontology using OWL/XML schema to facilitate the process 
of annotating semantic networks with concept maps. An 
example of a concept map for the concept of news story is 


illustrated in Fig. 3. 


Subflass 


Top-Level 


news_article__news_story__newspaper_article 
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Represents Represents Represents 


Equivalent Ry 
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Figure 3. Representation of a generated conceptual maps 


news_story 


The superclass and equivalent concepts are weighted 
according to the following equation and are appended to the 
document vectors/user profiles. Since the new concepts are 
obtained indirectly and are inferred using top-level ontology 
structure, their assigned weight is lower than the original 
concepts. The weighting parameter is estimated using a 
subset of evaluation data. 


Weightretatea Conceot(ontology) 
= Weight mitia Concept * 0.8 
(5) 
It should be noted that the semantic and graph-like 
structure of the concept maps are used in the semantic 
network generation phase to annotate the documents 
semantic networks and to infer new links between concepts. 
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Finally, Let ec; = {tf, tê, £27) u ,tf"} be the set of 
identified words/concepts for document d; during the 
enrichment stage, then d; = {t}, tê, t?,..... geet ts 
ET „tř"”} is the extended document vector after 


appending ec; to the original document vector di. 


3.3.4 The semantic network generation 

In order to generate the semantic networks, a robust and 
semantic weighting method must be used to identify and 
select the most informative concepts in documents. The CF- 
IDF measure does not possess the required semantic 
characteristic for this purpose. In this paper, a new weighting 
schema called salient score is introduced to select the most 
prominent concepts. These concepts will participate in the 
semantic network generation process. In the next step, four 
important relations in the ontology (namely, synonymy, 
superclass, subclass and Part_of) are established between 
concepts to link the concepts to each other. The connected 
concepts are organized in a graph-like structure to form the 
semantic network. Also, the enriched contents play an 
important role in identifying the concepts that reflect the 
information content of documents. Therefore, the generated 
semantic network acts as a thorough and comprehensive 
abstract of the documents. Accordingly, the semantic 
network generation process consists of two important parts: 
(1) calculating the salient score of concepts in the 
documents, (2) connecting the concepts using ontology- 
defined relations. Figure 4 illustrates the process of 
generating semantic networks. First, the elements of a 
semantic network are discussed here. 


3.3.4.1 The elements of a semantic network 
The semantic network consists of a set of concepts and the 
relations connecting them. In this paper, two given concepts 
are connected to each other through one of the four relations 
of synonymy, superclass, subclass, and part of relations. 
Definition of Concept: Concepts refer to a significant entity 
in the document. 
Definition of Superclass/Subclass relation: Assuming that 
two concepts x; and x; are given, if concept x; categorizes 
the concept x;, then x; is called superclass of x;, and x; is 
called subclass of x;. 
Definition of Synonymy relation: Assuming x; is a concept 
in the document, if we can find a concept x, and replace it , 
the informational content of the document does not change. 
Thus, it can be inferred that x; and x; are connected by 
Synonymy relation. 
Definition of Part_of relation: The Part_of relation 
represents the part-whole relationship between the concepts. 
The Part_of relation is established between concepts x; and 
Xj, if presence of x; implies the existence of x;. However, the 
presence of x; does not indicate the presence of xj. The 
Part_of relation is obtained by aligning DBpedia ontology 
[73] and its related NLP datasets with OntoWordNet 
ontology. 
Theses relations can be represented in the form of ordered 
triplet [Subject, Object, Relation]. For example: 

e Superclass relation: [Sports, Football, Superclass], 

e Subclass relation: [Football, Sports, Subclass], 

e Synonymy relation: [Sports, Athletics, Synonym], 
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e  Part_of relation: [Halfback, Football, Part_of], 
e Part_of relation: [Halfback, Sports, Part_of], 
e Subclass relation: [Halfback, position, Subclass]. 
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Figure 4. Semantic Network Construction Process 


3.3.4.2 The salient score 

Not all the concepts/words contribute to the information 
content of a document. Some are only used to impart the 
linguistic or formal expressions and they are usually 
meaningless. Therefore, it is better to discard the trivial 
concepts/words which have insignificant contribution to the 
context. 

In this paper, in order to identify the most informative 
concepts/words in the document, salient score is introduced, 
which has three different criteria: (1) structural criterion, (2) 
CF-IDF criterion, and (3) semantic criterion. In other words, 
the proposed weighting schema is a hybrid schema that 
integrates the term-based weighting, structural-based 
weighting and the knowledge-based weighting approaches. 
The Structural Criterion: Let D = {d}, dz, ... ...., dn } be a 
set of retrieved documents for a given query, n be the number 
of the retrieved documents and dj = tt! j t, nueu ti) be the 
extended document vector dj. Also sub(t! ), supt) , 
partof(t!) and synonym(t/) are sets of concepts/words in dj 
that are connected to the concept tl via subclass, superclass, 
Part_of and synonym relations, respectively. The structural 
criterion score for tl is calculated by str_Score(t!) as follows: 


StTscore (d) = 


|sub (t/) + sup(t/) + partof(t?) + synonym(t/)| 
max(|swb(t; )| + |up(t})| + |partof(t; )I + |synonym(t; yD 
: 0, O.W. 
t is a uni/bigram in document 


(6) 
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This criterion assumes that for each concept/word in the 
document there is at least one concept in the document that 
is connected to tl via subclass/ superclass/ Part_of/ synonym 
relations. As the number of subclass/ Part_of/ synonym 
concepts for tl in the document increases, higher score for 
structural criterion will be yielded. If no 
subclass//Part_of/synonym for tl is found in the document, 
the structural criterion score for tl is zero. 

The CF-IDF Criterion: The CF-IDF criteria and how the 
concepts in the document are weighted is fully discussed in 
section 5.1. 

The Semantic Criterion: This criterion is calculated by the 
following equation and is denoted by Semscore G ): 


SeMscore (el) = 


IA Sempistance(t}, ti) <a} + I(t? ICc(t!) < Byl 
2 » |a;| 


(1< K< |a) 
(n) 
Where, t? is the underlying concept in the document, ty 
represents the elements of document vector and 
SeMpistance(t?, tf) . This is also known as the semantic 
distance, which refers to the minimum number of nodes 


between two concepts t? and ti in the hierarchical structure 


of KBs. The /C(t/) is the information content of tf computed 
using the WordNet and Penn Treebank. More information is 
available in some other studies [69, 37]. The parameters a 
and ß are Thresholding parameters for the semantic distance 
and the information content. The illustrated equation means 
that when the context of a given document is about a 
particular information domain, the concepts/ words in the 
document are very similar in terms of the context, and they 
form a cluster in ontology/KBs as such. In other words, the 
concepts/words with high semantic distance from the context 
are considered insignificant and will have a lower semantic 
score as such. 

In order to calculate the salient score, a linear and 
weighted combination of these Criterions is computed as 
follows: 


Salscore(t}) = wi: StTscore(ti ) +w. cf — idf (t}) 
+w3.SeMscore (t?) 
(8) 

In this equation, w1, w and w3 are the weighting 
parameters for calculated scores between (0- 1) and their sum 
is equal to 1. These parameters are estimated using a subset 
of evaluation data in the evaluation stage. It should be noted 
the salient score is computed only for the concepts in 
documents. All the concepts in user profile will participate 
in user profile semantic network generation. In the next step, 
the top-n% of the concepts/words with highest salient score 
are selected to generate the document semantic network 
3.3.5 The document semantic network generation 
When the top-n% concepts/words are projected onto the 
OntoWordNet ontology, a number of separated clusters of 
concepts/words are formed in ontology because some of 
concepts/structures that can connect the separated clusters 
are left out. One of the main objectives of the proposed 
method is to identify the connective concepts/structures, so 
that a comprehensive, thorough and connected 


representation of the documents is formed. In order to 
generate the document semantic network, the proposed 
algorithm puts together the identified concepts/structures 
one by one, connects them using the aforementioned 
relations and then forms a connected graph. 
Concepts/structures essential for generating a fully 
connected semantic network and connecting the separated 
concept clusters are mostly identified during the content 
enrichment stage. These connecting/structures concepts are 
called Liaison features. This property of the proposed 
algorithm contributes heavily to the novelty of the proposed 
method. The proposed algorithm for generating semantic 
networks is illustrated in Figure 5. Also, Figure 6 depicts 
how the semantic network is formed and how the Liaison 
features connect the separated concept clusters. 

The resulting semantic networks will be represented as a 
sub-ontology using the OWL/XML schema. Such a 
representation not only makes the semantic networks 
machine-readable but it also enables us to merge them with 
the generated concept maps. 

Input: set of documents D={D1, D2,...Dn}, set of prominent 
concepts in each document D’={ti, t,..., tn} 
e Loop: for each concept in D 
e Loop: until D’ is empty 
e Condition: if semantic network is empty 
e Append the first concept to the semantic 
network. 
e Delete the first Concept from D’ 
e End of Condition 


e Min Node= the minimum of nodes between 
concepts in the hierarchical structure of ontology 
and KB 

e Loop: for each tithat already exists in the semantic 
network 


e Loop: for each tjin the D’ 
e Condition: if the distance between tiand tjis 
less than Min_Node 
e Source= ti; Destination= tj 
e Min Node= the minimum distance 
between tiand tj 
e End of Condition 
e End of Loop 
e End of Loop 
e Add “Destination” to the semantic network and 
Remove the “Destination” from D’ 
e Condition: if Min_Node is equal to 1 
e Connect tiand tj via superclass/subclass relation 
e Condition: if Min_Node is greater than 1 
e For each edge between tiand tj 
e Add the endpoint concept of the 
respective edge to the semantic network 
° End of Condition 
° End of Condition 
e End of loop 
e End of Loop 
Output: the generated semantic network for the D’ 


Figure 5. The pseudo-code for the creatio of semantic 
network 


As shown in Figure 6, after projecting concepts/words 
onto the OntoWordNet ontology, two separated clusters of 
concepts are formed in ontology. By analyzing the ontology, 
it can be understood that the concept “info information” is 
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the Liaison concept for connecting the two separated 
clusters. By enriching the content of documents, using the 
ontology and Wikipedia-based approaches, the concept 
“info information” is appended to the document semantic 
network and the connection between the two separated 
clusters is established. Also, the concepts “message”, 
“story”, “ 
concepts for connecting the already constructed document 
semantic network with concepts in the higher/deeper 


hierarchical structure of the ontology. 


television news” and “newscast” act as the Liaison 


> 
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Figure 6. An example of Semantic Network 


To provide a personalized experience, the user queries 
and user’s previously accessed documents are analyzed to 
generate the user profile semantic network. Using the 
proposed algorithm, user queries and all the previously 
accessed documents are converted into semantic networks. 
The resulted semantic networks are then combined to create 
a connected graph. The graph represents a portion of the 
ontology that covers the informational preferences and 
priorities of the user. The process of personalized document 
identification is illustrated in Figure 7. 


3.3.6 the semantic similarity of computation modules 
In the proposed semantic information indexing and 


management method, the semantic similarity between a 
document semantic network and a user profile is computed 
based on three types of similarity measures. The first two 
types compute the similarity based on the established 
relations between concepts, while the last one computes the 
semantic similarity based on the commonalities in semantic 
features. The proposed semantic similarity measure relies 
heavily on the structured knowledge of ontology, Wikipedia 
and WordNet. 
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Figure 7. The process of personalized document retrieval using 
user profiles 


3.3.6.1 The semantic relation robustness measure 

This measure determines the robustness of established 
relations between concepts. When multiple relations have 
the same ‘subject’, this measure determines which relation is 
more effective in describing the ‘subject’. It should be noted 
that ‘subject’ is the informative element in a relation between 
concepts. Assuming d; and rel; represent the document 
vector and the set of relations between them in a semantic 
network, respectively, a generated semantic network 
by SN(d;) = [(d;) , rel]. Also, if it is assumed that (t;, rel, tx) 
represents a semantic relation between a subject t; and an 
object t, in document d;, hence the set rel; can be written 
as rel; = {(t;,rel,t,)|t,t, €(d;)}. In this case, the 
discriminatory power of a relation such as (t;,rel,t,) is 
obtained using the following equation: 


SCOP €discriminity (SN(d;), UP) = 


Dau the triplets SCOT €discriminity ((t;» rel, ti.) 


number of triplets in the semantic net. 


SCOT Cgiscriminity (4, rel, t)) = 


2* (SRaoc(tj) + 1) * (SRup(t;) + 1) 
2 + (SORagoc( tj, te) + SORvp(t), tk) 
10 


10 — in ) 


1- 


SR=(Subject-relation), SOR=(Subject-0bject-relatioN), 
UP=(User Profile), doc=(document) 
(9) 


Where tj and t, represent the subject and object of a 
networks, SRaoc(tj) and 
SRyp (t;) represents the number of the relations with t; as the 
subject, SOR 4oc(t;, tx) and SORyp(t;, tų) represents the number 


relation in semantic 
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of the relations with t;as the subject and tx as the object in 
documents and queries, respectively. Therefore, a relation 
such as (t;,rel,~) with high values of SR and low value of 
SOR is more robust in describing the subject tj. It can be 
concluded that this relation is a descriptive relation for 
subject t;. The documents that share highest number of 
descriptive relations with user query have the highest 
similarity score. 


3.3.6.2 The semantic relation effectiveness measures 
These metrics measure how much the semantic network of a 
document is effective in covering the information content of 
the user profile based on semantic network. 

The explicit measure calculates the amount of shared 
information content between the occurring relations in the 
documents and the user preferences and how much the 
semantic network of a document is similar to a user profile. 
This measure is computed as follows: 


SCOT €rxpiicit (u (t;, rel, ty) E SN(d;)) = 


Žau the triplets Score_teTMexplicit(t,rel,ty) 


number of triplets in the semantic net. 


tj, tg are in user profile 


ScOrerermexpiicit (0) = li — Sexp o.w. 


(10) 
In this equation, exp is a threshold between [0, 1]. When 
the subject t; and object tę of a relation in the document 
semantic network appears in the user profile, a high 
similarity score is assigned. 
The implicit measure evaluates the document semantic 
network and how much it resembles the semantic network 
representation of the user preferences: 


SCOT eimpiicit (u (t;, rel, tp) E SN(d:)) = 


Yall the triplets Score_relationimpiicit(t;ret,ty) 


number of triplets in the semantic net. 


SCOT relation implicit ((t, rel, te)) = 


Simp» (t;, rel, tg) is in user profile 
it — Simp O.W. 
(11) 
Where exp is a threshold between [0, 1]. When a 


relation with subject tj; and object tę appears in the user 
profile, a high similarity score is assigned. 


3.3.6.3 Semantics-based measures 
Semantic features of textual resources are the most 


informative portion of information content. Computing the 
amount of commonalities and/or differences in semantic 
features between two semantic networks can be a good 
indicative of similarity between them. 


WordNet-based semantic similarity measure: This 
method is based on the notion of Information Content (IC) 
of the Least Common Subsumer (LCS) [58]. IC is a measure 
of the specificity of a concept, and the LCS of concepts A 
and B is the most specific concept that is an ancestor of both 
A and B. Higher commonalities in semantic features indicate 
higher similarity score. This method is called normalized 
Jiang and Conrath measure and is calculated as follows [74]: 


WordNetgcore(A, B) = 


fs & (A) + ICnrm(B) — 2 * Hoel ESI) 
2 


(12) 
Wikipedia-based semantic similarity measure: It 
computes the semantic similarity between the two concepts 
based on the commonalities and differences in their 
respective second-order and first-order vectors. For this 
purpose, Lin’s information theoretic measure [31,40] is 
utilized: 


Wikiscore(A, B) = 


Ècrel, w’ ) freq (A, ¥reb ty) + freq(B, žreb *w) 
È crei, w' ) freq (A, *rer B) + Ère, w ) freq (B,*ren A) 


* rel = 
{co — occurrence rlation, contextually_similar relation} 


* w = {concepts in either the document or user profile} 
(13) 

Where A and B are concepts in document and user query 
respectively. The freq () function calculates the frequency 
of A or/and B in these relations. According to Lin’s 
information theoretic measure, the similarity between 
concept A and B is related to the commonalities and 
differences between them. Higher level of commonality 
means higher similarity score for two concepts. This measure 
is somewhat similar to latent semantic analysis, especially 
the one applied in [75]. 

For WordNet-based and Wikipedia-based measures, the 
notion of semantic similarity between two concepts is used 
to compute the similarity between the user profiles semantic 
network and the semantic network of documents. These two 
methods compute the semantic similarity between all the 
possible pair of concepts in the document semantic network 
and the user profile semantic network and generate a number 
between (0-1), which indicates the similarity score. 

Finally, to compute the overall semantic similarity, a 
linear and weighted combination of these measures is used 
as follows: 


Similariyuscore(SN (d;), UP) = 


(k, * SCOP €giscriminity (SN(d;), UP)) 


vtjed; 
+ | k2 * Scorerxniicit (Ure tp) E€ SN(d;) ) 


vtkEdi 
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vtjed; 
+ | k3 * SCOT eimpiicit (Ure ty) € SN(d;) ) 


VtpEdj 


+1 k,* Dinca Salscore (A) + WordNe tscore (A, B) 
i È vAedi Salscore (A) 


+ k : aa Salscore (A) x Wikiscore (A, B) 
R Divaed; Salscore (A) 
(14) 


Input: the preliminary simple indexes D={D1, D2,...Dn}, user 
queries and previously accessed documents 
e Loop: for each previously accessed documents by the user 


° Construct the semantic network and combine it 
with other networks and construct user profile 


e End of Loop 
e Loop: for each document in D 
° Retrieve the most similar preliminary indexes to 
the user query. 
e End of Loop 
e Loop: for each retrieved document 
° Construct the semantic network 
e End of Loop. 
e Compute the semantic similarity between document 
semantic networks and user profile semantic network. 
e Rank the document according to the similarity of their 
semantic network to user profile semantic network. 
Output: Retrieved documents based on the user preferences. 


Figure 8. The pseudo-code for semantic indexing and Retrieval 
system 


Where k1, k2, k3, k4 and ks are the weighting parameters 
between (0-1), while their sum is equal to 1. In the final step, 
documents are ranked according to their similarity with the 
user profile and the results are displayed to the user. Figure 
8. Illustrates the proposed semantic information indexing 
and management algorithm. 


4. Evaluation 

4.1 The experiments and the data collection process 

In order to evaluate the proposed method, a series of 3 
experiments with different parameters are considered. The 
first experiment evaluates the accuracy of the semantic 
network modelling of modules and the semantic similarity of 
computation modules. It also measures their effect on the 
overall performance of the proposed method. The second 
experiment evaluates the efficiency of the system in 
identifying the most similar documents with regard to the 
user preferences. The third experiment evaluates the 
effectiveness of the proposed method in predicting the 
correct topic classification of documents. 
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Table 2. The topics and their constituent classes in the 20 
newsgroup dataset 


# Topics The Constituent classes 


Comp.graphics 
Comp.os.windows.misc 
Comp.sys.ibm.pc.hardware 
Comp.sys.mac.hardware 
Comp.windows.x 


1 | Computer 


Rec.autos 
Rec.motorcycles 
Rec.sport.hockey 
Re.sport.baseball 
Sci.crypt 
Sci.electronics 
Sci.med 
Sci.space 
Misc. forsale 
Talk.politics.gun 
Talk.politics.mideast 
Talk.politics.misc 
Talk.religion.misc 
Alt.atheism 
Soc.religion.christian 


2 | Recreation 


3 Science 


4 MISC 


5 Politics 


6 | Religion 


One of the most important and widely used text dataset 
in the field of text mining and related applications is the 
20Newsgroup dataset [76]. This dataset consists of 19997 
news articles and web pages categorized in twenty different 
classes (or newsgroups). In the last update (released in 2008), 
several existing duplications were removed; thus, the 
number of unique documents in the dataset was reduced to 
18827. Consequently, the number of unique concepts/words 
which occurs more than once in the dataset is equal to 71830. 
Since some of the classes in the dataset are contextually 
related to each other, the documents can be classified in 
broader categories called topics. Table 2 illustrates the topics 
and their constituent classes in the 20Newsgroup dataset. 


4.2 The evaluation of the semantic network generation and 
semantic similarity modules 

For performance evaluation, a set of 4000 documents of 
20Newsgroup was selected randomly. These documents 
were categorized in five different topics. Out of 4000, 800 
documents were categorized in the “sciences” topic, 800 
were in the “computer”, 800 documents were in “politics”, 
800 documents were in “religion”, and 800 eighty documents 
were in the “recreation” topic. 

Furthermore, 10 different tests were designed to evaluate 
the performance and precision of the proposed method. In 
other words, for each topic in dataset, two experiments are 
designed. These tests are designed to evaluate two important 
component of the system: (1) evaluating the semantic 
network generation of modules, and (2) evaluating the 
semantic similarity of modules. Also, the performance of the 
proposed method is compared to other similar approaches. 

The details of the designed test are described here. For 
each test, 800 documents out of 4000 are selected randomly 
from the respective topic. The remaining 3200 documents 
are selected from the other four topics which are completely 
irrelevant to the respective topic. 

For each topic, two different queries are extracted from 
the documents in the respective topics. In order to do this, 
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the documents in each topic are analyzed to identify the most 
frequent and informative concepts/words. Then, a list of 
candidate concepts/words is formed based on this analysis 
and presented to the experts. The experts select the 
concepts/words that can describe the underlying topic the 
best and the preliminary queries for each topic are formed. 
In the next step, all the queries are enriched. The main 
objective of these tests is to evaluate the semantic indexing 
and management capabilities of the proposed system when 
faced with different queries. In addition, two documents, 
which are deemed most similar to the subject of underlying 
topic by the experts, are selected to act as user’s previously 
accessed documents. 

In this paper, Mean Average Precision (MAP) is used to 
evaluate the performance of the proposed approach. The 
MAP value is the arithmetic mean of the average precision 
values for each query [54]. Meanwhile, MAP has been 
known for its good discrimination and robustness [77]. For a 
given queryq;, the map value is calculated as follows: 


MAP; = 5g- Precision (Rx) (15) 


Where m is the number of the retrieved documents, Rẹ is 
the set of ranked retrieval results from the top until the k-th 
retrieved document. Table 3 depicts the queries for each 
test. 


4.2.1 Evaluating the semantic network generation process 
When document semantic networks are generated, the 
underlying assumption for computing the salient score for 
each concept is that the information content of the documents 


can be represented more efficiently by a portion of concepts 
that are most informative. In order to verify this assumption, 
different percentages of the salient concepts are utilized to 
generate the semantic networks. Then, the semantic 
similarity between the document semantic networks and the 
user profiles is measured. In the end, the precision and 
performance of the proposed method is evaluated using the 
average MAP score of the 10 queries. Also, in order to 
demonstrate that the salient score achieves much better 
results than the CF-IDF weighting method, the performance 
of the proposed method with the salient score and the 
proposed method with CF-IDF instead of the salient score 
are compared and evaluated. The results are illustrated in 
Figures 9 and 10. 
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Figure 9. Evaluation of term selection measure 


Table 3. Queries for each test 


# # of related | The constituent keywords of The appended concepts after the Class 
documents each query enrichment (Expanded Query) 
Bible, Christian, belief, theist, | Gospel, Quran, ideology, doctrine, sds 
l 800 heaven, faith, History agnostic, hell, soul Religion 
2 800 God, Atheism, religion, Worship, Christ, faith, Christianity, Religion 
Jesus, devil theism 
Car, bike, auto, , vehicle, Truck, engine, automobile, steer, i 
3 800 Brake, oil , driver gear, fuel, rider Recreation 
season, hit, pitch, team, score, | Softball, varsity, pitcher, baseman, : 
4 800 catcher, baseball Yankees, infielder Recreation 
Space, launch, technology, Spacecraft, NASA, system, 
5 800 orbit, satellite, research milky engineering, innovation, space Science 
way shuttle, radar, transponder 
Health, patient, medical Mental health, education, nutrition, ) 
6 800 A ; : : : Science 
clinical, disease, diagnosis symptom, treatment 
7 800 Gun, e violence, rifle, Conflict, racism, riot, weapon, Politics 
rug, victim ammunition, law enforcement 
Military, war, government, Civilian, air force, naval, troop, ae 
8 800 building, assault, crowd invasion, army Politics 
Software, graphic, render, Computer, open source, processor, . 
2 $00 shader, display, interface polygon, image, VGA Computer 
10 800 Computer, system, hardware, Processor, computer, decoder, Computer 
device, storage, driver server, data, disk, application 


Journal of Computer and Knowledge Engineering, Vol. 3, No. 2, 2020. 


80 Evaluation of Enrichment Process 


70 


60 


MAP 


50 

40 
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
Top n% concepts selected to generate Semantic Networks 


== Proposed === Proposed_without enrichment 


Figure 10. Evaluation of enrichment process 


In the next step, we evaluate whether the enrichment 
process have any effect on the generated semantic networks 
and consequently on the accuracy of the final results. 
Therefore, the performance of the proposed method with 
enrichment module and the proposed method without 
enrichment module are compared and evaluated. 

As illustrated in Figure 11, the underlying assumption 
holds true again and the documents can be represented more 
efficiently by the top 50% of the prominent concepts. Also, 
the proposed method coupled with the enrichment module 
performs far better than the proposed method without the 
enrichment module. As mentioned earlier, one of the most 
important components of the proposed method is the 
enrichment module. The content enrichment component 
helps the system identify the Liaison features that link the 
separated concept clusters, which forms a connected 
semantic network. 


4.2.2 Evaluating the performance and the efficiency of the 
semantic similarity measure 

In this section, a series of tests are designed to evaluate the 
performance and efficiency of the proposed semantic 
similarity module. Five tests for different settings are 
prepared to determine the effectiveness of different 
components of the proposed hybrid similarity measure and 
their effect on overall precision. This is done by computing 
the average MAP scores of the queries. The settings of the 
designed tests are illustrated in Table 4. 


Evaluation of similarity measures 
Proposed - DiscriminityScore g 1.45 
eae 715 
Proposed - ExplicitScore ld 7.65 
ed a 
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ed 65!) 
Proposed - WordNetScore Wm 7.2 
nd 65.5 


Proposed — Wikiscore uy 12.25 
eed 63.15 


Proposed 
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Figure 11. Evaluation of relevance/similarity measures 


The illustrated results suggest that among the 
contributing components of the similarity module, the 
Wikipedia-based components (Wikiscore) have the greatest 
effect on the precision and efficiency of the proposed 
method. The relation-based components (explicit score and 
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implicit score), WordNet-based components (WordNet 
score) and structural component (Discriminity score) are in 
the next places, respectively. The results were somewhat 
expected as the semantic constructs and relation-based 
structure of ontology, WordNet and Wikipedia KBs make 
them perfect tools for computing the semantic similarity. 


Table 4. The settings of the designed test for evaluating the 
effectiveness of the proposed hybrid similarity measure 


# Experiment Description of the Experiment 
1 Proposed The complete process of semantic 
similarity 
2 Proposed — The process of semantic similarity 
Wikiscore without Wikiscoreth 

3 Proposed - The process of semantic similarity 
WordNetScore without WordNetScore 

4 Proposed - The process of semantic similarity 
ExplicitScore without ExplicitScore 

5 Proposed - The process of semantic similarity 
ImplicitScore without ImplicitScore 

6 Proposed - The process of semantic similarity 

DiscriminityScore without DiscriminityScore 


4.2.3 Exploring other existing Approaches 
Two different approaches are selected here; we have 


implemented these approaches based on the description of 
their ranking formula. The first approach is called Lucene 
scoring function [61], which uses a combination of vector 
space model and Boolean matching model to rank and 
retrieve the most similar documents to user preferences. 
More details about Lucene scoring function is depicted in 
[61, 54]. The second approach is proposed by Daoud. et al. 
[18]. This method introduces a personalized ontology-based 
ranking approach. The documents and user profiles are 
represented by graph structures and the relations between the 
concepts are established using a web ontology. Then, a 
graph-based distance measure computes the similarity 
between document graphs and user profile graph. More 
details about this ranking method is depicted in [18]. We 
have also already attempted to compare the performance of 
the proposed method with the relation-based ranking 
approach in another study [54]. However, since the utilized 
ontology is a modified one which is not available to the 
public, the comparison was not possible in this study. The 
comparison results are obtained by computing the average 
MAP score for 10 queries. The results are illustrated in 
Figure 12. 
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The comparison of proposed method with similar ranking 
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Figure 12. Comparison of the proposed method with similar 
methods 


As illustrated in Figure 12, the proposed method 
outperforms its similar methods of ranking and retrieval and 
also exhibits high performance and precision. An interesting 
fact about the illustrated results is the relatively good 
performance of the Lucene method in “Recreation” Topic. 
The recorded results suggest that the Lucene Ranking 
method performs well, especially when most of the 
documents can be retrieved using only the keywords in the 
documents. 


4.3 Evaluating the performance of the system in topic 
classification 

In the next step, the capability of the proposed method in 
topic classification of documents is assessed. To this end, 
five tests are prepared. Each test is designed to evaluate the 
capability of the proposed method in classifying documents 
in the correct topic. In order to do this, 10000 documents are 
selected randomly from the 20Newsgroup dataset. Out of the 
10000, 2000 document are from the “computer” topic, 2000 


from the “religion” topic, 2000 from the “politics” topic, 
2000 from the “recreation” topic and the rest are from the 
“science” topic. The first test assesses the performance of the 
proposed method in classifying documents from the 
“computer” topic. In this test, the selected documents from 
the “computer” topic are labelled “relevant” and the 8000 
remaining documents from the other topics are labelled 
“irrelevant” to the “computer” topic. The second, third, 
fourth and fifth tests are designed to evaluate the 
performance of the proposed method in classifying the 
documents from the “religion”, “politics”, “recreation” and 
“science” topics respectively. In order to generate the user 
profiles that reflect the information content of each topic, the 
same procedure applied in section 6.2 was also used here. 
The designed queries for each topic are illustrated in Table 
Ds 

It should be noted that only the Wikipedia-based enriched 
concepts are displayed in this table and other enriched 
information are not displayed. 

The evaluation mechanism in this step is describe below. 
First, for each topic in the dataset, the semantic network of 
user profile is created according to the user queries and user’s 
previously accessed documents. Next, the semantic 
similarities between the semantic networks of the documents 
and the user profiles are computed. This process results in a 
set of five similarity scores for each document in the test 
dataset; each score indicatd the degree of similarity between 
a document and one of the topics in the dataset. In the next 
step, each document is classified in terms of the topic with 
the highest semantic similarity score. According to the 
obtained results, the documents are classified and labelled as 
TP (true positive), TN (true negative), FP (false positive) and 
FN (false negative). Finally, in order to measure the 
performance of the system in terms of topic classification, 
the following measures are taken. 

The evaluation results are illustrated in Table 7 and 8 
below. 


Table 5. The queries needed to retrieve the most similar documents for each topic 


# a The constituent concepts of each que The appended concepts after the enrichment class 
documents P query (Expanded Query) 
Bible, Christian, belief, theist, heaven, faith, Gospel, Quran, ideology, doctrine, agnostic, hell, 
1 2000 History soul Religion 
God, Atheism, religion, Jesus, devil Worship, Christ, faith, Christianity, theism 
Car, bike, auto, , vehicle, Brake, oil , driver Truck, engine, automobile, steer, gear, fuel, rider 
2 2000 season, hit, pitch, team, score, catcher, Softball, varsity, pitcher, baseman, Yankees, Recreation 
baseball infielder 
Space, launch, technology, orbit, satellite, Spacecraft, NASA, system, engineering, 
3 2000 research, milky way innovation, space shuttle, radar, transponder Stienie 
Health, patient, medical, clinical, disease, Mental health, education, nutrition, symptom, 
diagnosis treatment 
Gun, police, violence, rifle, drug, victim Conflict, racism, riot, weapon, ammunition, law 
4 2000 Military, war, government, building, assault, enforcement Politics 
crowd Civilian, air force, naval, troop, invasion, army 
Software, graphic, render, shader, display, Computer, open source, processor, polygon, image, 
5 2000 interface VGA Computer 
Computer, system, hardware, device, storage, Processor, computer, decoder, server, data, disk, P 
driver application 
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Table 6. Evaluating the performance of the system 


Precision + Recall 


True Label 
Selected As 
Relevant Irrelevant 
Relevant True Positive False Positive 
Irrelevant False Negative True Negative 
TP+TN 
= 17 
Accuracy = TPE RP 4+TN + FN (17) 
TP 
ision = ———— 18 
Precision TEST (18) 
TP 
= 19 
Recall TEIN (19) 
Precision * Recall 
F — measure = 2 * (20) 


Table 7. The evaluation results in terms of TP, TN, FP, and FN 
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Test Topics TP TN FP FN 
Test #1 Computer 1936 7958 42 64 
Test #2 Religion 1959 7909 91 41 
Test #3 Politics 1903 7926 74 97 
Test #4 Recreation 1877 7784 216 123 
Test #5 Science 1782 7634 366 218 
Table 8. The evaluation results in terms of accuracy, precision, recall and F-measure 
Test Topics Accuracy Precision Recall F-measure 
Test #1 Computer 98.94% 97.88% 96.8% 97.34% 
Test #2 Religion 98.68% 95.56 % 97.95% 96.74% 
Test #3 Politics 98.29% 96.26% 95.15% 95.7% 
Test #4 Recreation 96.61% 89.68% 93.85% 91.72% 
Test #5 Science 94.16 % 82.96 % 89.1% 85.92 % 
Mean Performance 97.34% 92.47% 95.09% 93.48% 
Perfomance of the Proposed Method 
SSS SSS SSS = 
Science 
ST] 
= SSS SSS 
es 
S] 
Politics 
eee 
SSS SSS SSS SSS SSS 
Ro — rmm 
pE IM 
Compe O 
50% 60% 70% 80% 100% 
= F-Measure Recall Precision WŒ Accuracy 


Figure 13. The evaluation results 
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Table 9. The results of hypothesis testing for each topic in the test dataset 
(The number of relevant/irrelevant documents in each topic: 2000/8000, the significance level=5% (0.05)) 


Test Topics Observed Mean es p-value Null Hypothesis 
Test #1 Computer -0.6040 0.7974 0.9109 Accepted 
Test #2 Religion -0.5940 0.8049 0.8673 Accepted 
Test #3 Politics -0.5980 0.8019 0.9555 Accepted 
Test #4 | Recreation -0.5740 0.8193 0.4729 Accepted 
Test #5 Science -0.5660 0.8248 0.3497 Accepted 


As it is illustrated, the proposed method exhibits high 
accuracy, recall and precision rate in the classifying 
documents in the “computer”, “religion” and “politics” 
topics. Also, the lowest accuracy, precision and recall rates 
are recorded in the classifying documents in the “science” 
and “recreation” topics (although the recorded recall and 
accuracy rates are quite suitable for text mining 
applications). After investigating the documents in these 
topics, it was found that there was a relatively clear 
distinction in the content of these topics. Therefore, we came 
to the conclusion that the distinction in the documents might 
be the reason for these results. In other words, when the 
documents of a topic discuss subjects related to other topics, 
lower results (esp. lower precision rate) might be expected. 


4.4 The evaluation of the effectiveness and reliability of the 
proposed method in predicting the correct topic 
classification of documents 

In the last step, the effectiveness of the proposed method in 
predicting the correct topic classification of the documents 
in each topic is assessed. This is carried out through 
hypothesis testing. In this step, 10000 documents from 
20Newsgroup dataset are randomly selected. The test dataset 
consists of 2000 documents per topic. The manner in which 
the hypothesis testing is conducted will be explained for one 
topic and the hypothesis testing for other topics will be 
conducted the same way. Assuming that the user preferences 
is closely related to the content the “computer” topic, the 
semantic network of the user profile is created using two 
documents from this topic. These documents reflect the 
information content of the “computer” topic very well. In the 
next step, the documents in the “computer” topic are 
assigned to the label “1” and the documents in other topics 
are assigned the label “-1”. In the next step, the semantic 
similarity found between the documents and the user profiles 
of each topic is computed. If the similarity of a given 
document to the “computer” topic is higher than the other 
topics, the prediction label “1” is assigned to this document, 
otherwise the prediction label “-1” is assigned to this 
document. The assigned prediction labels act as the topic 
prediction for each document. In other words, if the true label 
of each document is equal to its prediction label, the 
document is classified as the correct topic, otherwise the 
topic classification of the document is incorrect. 


Hypothesis testing to evaluate the effectiveness of the 
proposed method in predicting the correct topic classification 
of document in the “computer” topic: In order to conduct the 
hypothesis testing on the randomly sampled test data, the 
two-sample t-test is performed. It should be noted that the 


optimal value (correct prediction label) for documents 
relevant to the “computer” topic is 1 and the optimal value 
of irrelevant ones is -1. The mean and sample standard 
deviation of the computed prediction labels for test 
documents is found to be -0.6040 and 0.7974, respectively. 
The purpose of two-sample t-test is to test whether the means 
of two different populations, namely the population of true 
labels and the population of prediction labels, are equal or 
not. The two-sample t-test does not assume the equality of 
variances. The null hypothesis is formulated in the 
following: 

Ho: 

The data of both populations come from independent random 
samples of normal distribution with equal means without 
assuming that the populations have also equal variances (i.e., 
the proposed method is capable and effective in predicting 
the correct topic classification of the documents in the 
“computer” topic). 

Hi: 

The null hypothesis is rejected. That is, tthe proposed method 
is not capable and effective in predicting the correct topic 
classification of the documents and the results may have 
been obtained by random chance in the sample selection 
process. 

The significance level is 5% (0.05). To assess whether 
the null hypothesis should be accepted or rejected, first we 
need to calculate the t-value: 
t= (1-72) 


2 2 
Sa p22! 
ny n2 


In this equation, x, and X, are the sample means, s, and 
S2 is the sample standard deviation, and nı and nz include 
the sample size. The following results are obtained for the 
hypothesis testing on the “computer” topic: 
p-value = 0.9109. As the p-value is greater than the 
significance level, the null Hypothesis must be accepted. In 
other words, both populations come from a normal 
distribution with equal means. This suggests that the 
proposed method is in fact capable and effective in 
predicting the correct topic classification of the documents 
in the “computer” topic. 

The result of the second, the third, the fourth and the fifth 
tests are illustrated below in Table 9. 

The illustrated results in Table 9 suggest that the 
proposed method is both effective and reliable in identifying 
the correct topic classification of documents in each topic 
(i.e., the null hypothesis is accepted in all cases), and the 
results have not been obtained by random chance during the 


(23) 
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sample selection process. 


5. Conclusion 

In this paper, a novel method of semantic information 
indexing and management is introduced. The proposed 
method is developed by integrating the structured knowledge 
of ontology and KBs (esp. Wikipedia and WordNet) in every 
component of the proposed method. The documents and user 
profiles are represented by semantic network graphs. The 
main characteristics of the proposed method are the 
semantic, ambiguity-free and multi-level representation of 
the contents. In addition, the properties of semantic networks 
are applied to identify the documents similar to the user 
preferences. As mentioned earlier, the main contribution and 
novelty of the proposed method include (1) the integration of 
the structured knowledge in every component of the system, 
(2) utilizing the semantic networks for a unified and multi- 
level representation of textual resources, (3) introducing a 
hybrid weighting schema called the salient score, and (4) 
proposing a hybrid semantic similarity measurement. 

The proposed method is evaluated in three stages using 
the 20Newsgroup dataset. In the first stage, different 
components of the proposed system and their effect on the 
overall performance and efficiency is evaluated. The 
evaluation results suggest that (1) employing a portion of the 
most prominent concepts (top-50% with the highest salient 
score) to generate the semantic networks achieves the 
highest accuracy and precision, (2) the salient score 
(weighting schema) achieves better results compared with 
the CF-IDF weighting method. In other words, employing a 
semantic and relation-based weighting schema results in 
higher precision compared with a term/frequency-based 
weighting method, (3) the enrichment module have a 
significant effect on the overall precision and performance 
of the proposed method and (4) among the contributing 
components of the hybrid similarity measurement, the 
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